Improve HTTP response handling and retry logic#520
Improve HTTP response handling and retry logic#520MichaelGHSeg wants to merge 18 commits intomasterfrom
Conversation
- Add Authorization header with Basic auth (base64 encoded write key) - Add X-Retry-Count header on all requests (starts at 0) - Implement Retry-After header support (capped at 300s) - Retry-After attempts don't count against backoff retry budget - Add granular status code classification: - Retryable 4xx: 408, 410, 429, 460 - Non-retryable 4xx: 400, 401, 403, 404, 413, 422 - Retryable 5xx: all except 501, 505 - Non-retryable 5xx: 501, 505 - Replace backoff decorator with custom retry loop - Exponential backoff with jitter (0.5s base, 60s cap) - Clear OAuth token on 511 Network Authentication Required - 413 Payload Too Large is non-retryable - Add 30 new comprehensive tests (106 total tests) Aligns with analytics-java and analytics-next retry behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with analytics-java change to accommodate shorter backoff periods (0.5s base, 60s cap). With faster retries, a higher retry limit allows for better resilience during extended outages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves HTTP response handling and retry logic in the analytics-python client, aligning with changes from analytics-java and analytics-next. The implementation replaces the backoff decorator with a custom retry loop that provides fine-grained control over retry behavior, distinguishing between Retry-After attempts and exponential backoff attempts.
Changes:
- Replaced backoff decorator with custom retry loop implementing exponential backoff with jitter (0.5s base, 60s cap)
- Added Authorization header support (Basic auth with write key and OAuth Bearer token)
- Added X-Retry-Count header to track attempt numbers
- Implemented Retry-After header support (capped at 300s, doesn't count against retry budget)
- Granular status code classification distinguishing retryable from non-retryable errors
- Increased default max_retries from 10 to 1000 to accommodate faster retry cadence
- OAuth token clearing expanded to include 511 status code
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| segment/analytics/request.py | Added Authorization header (Basic/Bearer), X-Retry-Count header, parse_retry_after() function, and response object to APIError |
| segment/analytics/consumer.py | Replaced backoff decorator with custom retry loop implementing exponential backoff, Retry-After support, and granular status code classification |
| segment/analytics/client.py | Increased default max_retries from 10 to 1000 |
| segment/analytics/test/test_request.py | Added 17 comprehensive tests for authorization headers, X-Retry-Count, Retry-After parsing, and OAuth token clearing |
| segment/analytics/test/test_consumer.py | Added 13 comprehensive tests for retry logic, status code classification, backoff behavior, and Retry-After support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Extract duplicate exponential backoff calculation into helper function - Add upper bound (max_total_attempts) to prevent infinite retry loops with Retry-After - Improves code maintainability and prevents edge case of continuous Retry-After responses Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove 408/503 from Retry-After eligibility (only 429 uses Retry-After) - Add rate-limit state to Consumer (rate_limited_until, rate_limit_start_time) - 429 with Retry-After: set rate-limit state, raise to caller for requeue - 429 without Retry-After: counted backoff (not pipeline blocking) - Add maxTotalBackoffDuration / maxRateLimitDuration config (default 43200s) - upload() checks rate-limit state before request(), enforces duration limit - 511 OAuth gating: only retry when OauthManager is configured - Add tests: T04, T17, T19, T20; update 429/408/503 behavior tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Handle Retry-After: 0 correctly by checking 'is not None' instead of truthiness. Prevent silent batch re-queue on 429 without Retry-After by gating the upload() re-queue path on rate_limited_until being set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
segment/analytics/consumer.py:158
- When a 429 with Retry-After is received (lines 136-144), the batch items are re-queued using
queue.put(), but then in the finally block (lines 156-158), all items in the batch are marked astask_done(). This means the items that were re-queued are still marked as done, which could causequeue.join()to return prematurely before those re-queued items are actually processed. Consider either not callingtask_done()for re-queued items, or handling the re-queueing differently.
if e.status == 429 and self.rate_limited_until is not None:
# 429: rate-limit state already set by request(). Re-queue batch.
self.log.debug('429 received. Re-queuing batch and halting upload iteration.')
for item in batch:
try:
self.queue.put(item, block=False)
except Exception:
pass # Queue full, item lost
success = False
else:
self.log.error('error uploading: %s', e)
success = False
if self.on_error:
self.on_error(e, batch)
except Exception as e:
self.log.error('error uploading: %s', e)
success = False
if self.on_error:
self.on_error(e, batch)
finally:
# mark items as acknowledged from queue
for _ in batch:
self.queue.task_done()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add KeyError to except clause when parsing JSON response to handle missing 'code' or 'message' keys - Add explanatory comment on pre-existing except-pass pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
requests.Response.__bool__() returns False for non-2xx status codes. The checks `if e.response` and `if response` evaluated to False for 429 responses, so parse_retry_after() was never called and the SDK fell back to normal backoff instead of respecting Retry-After. Changed both checks to explicit `is not None` comparisons. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add on_error handler to capture delivery failures from the SDK. Reports success=false with the first error message when any batch fails (non-retryable error or retries exhausted). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align retry configuration with cross-library defaults. Base backoff (500ms) and max backoff (60s) already matched. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First request (retry_count=0) no longer includes the header. Retries with retry_count > 0 continue to send X-Retry-Count: 1, 2, 3, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with the analytics-next Node reference implementation which treats all 200-399 responses as successful delivery. Previously only exact 200 was treated as success, causing 201/204/3xx to be misreported as errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Resolve python via activated venv/devbox, fall back to python3 - Use $PYTHON -m pip instead of bare pip (fixes macOS/nix where pip is not on PATH) - Add devbox setup as recommended path in README, with venv fallback instructions
Deep Code Review —
|
| Gap | Severity |
|---|---|
max_total_backoff_duration=None — triggers TypeError in consumer |
Critical (matches issue #2 above) |
max_rate_limit_duration=None — triggers TypeError in consumer |
Critical (matches issue #2 above) |
| Queue full during 429 re-queue — items silently dropped | Important (matches issue #5) |
Retry-After: 0 tight-loop behavior |
Important (matches issue #4) |
Retry-After with HTTP-date format falls back to counted backoff |
Nice-to-have |
Multi-consumer (thread=2) rate-limit state is not shared |
Nice-to-have |
sync_mode has no retry logic — failures propagate directly |
Documentation gap |
retries=0 with a retryable (non-429) error |
Minor |
The happy-path and primary retry scenarios are very well covered. The new test suite is a significant improvement.
Positive Notes
- The manual retry loop replacing
backoff.expogives much better control over retry semantics. The dual-budget model (per-attempt count + wall-clock duration) is solid. - Separating 429+Retry-After (pipeline blocking) from 429-without-Retry-After (counted backoff) is the right design.
X-Retry-Countheader omitted on first attempt is correct — servers can use presence of the header to detect retries.parse_retry_aftercapping at 300s is a good safety valve.- The
on_errorcallback hookup ine2e-cliis a real improvement for observability in e2e tests. - The Basic auth migration from
HTTPBasicAuthto explicitbase64is fine (equivalent behavior, removes one import).
Summary
Implements improved HTTP response handling and retry logic to align with changes in analytics-java and analytics-next:
Authorizationheader (Basic auth with write key, OAuth Bearer token)X-Retry-Countheader on all requests (starts at 0)Retry-Afterheader support (capped at 300s, doesn't count against retry budget)backoffdecorator with custom retry loop for better controlStatus Code Handling
Retryable 4xx: 408, 410, 429, 460
Non-retryable 4xx: 400, 401, 403, 404, 413, 422, and all other 4xx
Retryable 5xx: All except 501, 505
Non-retryable 5xx: 501, 505
Retry-After Behavior
Test Coverage
Changes
Modified Files
segment/analytics/consumer.py: Custom retry loop, status code classification, Retry-After supportsegment/analytics/request.py: Authorization header, X-Retry-Count header, parse_retry_after()segment/analytics/client.py: Increased default max_retries to 1000segment/analytics/test/test_consumer.py: 13 new test casessegment/analytics/test/test_request.py: 17 new test casesAlignment
This implementation aligns with:
🤖 Generated with Claude Code